\documentclass[a4paper,preprint,12pt]{elsarticle}
\usepackage[T1]{fontenc}
\usepackage[english]{babel}
\usepackage{amsmath}
\usepackage{caption}
\usepackage[latin1]{inputenc} 
\usepackage{multirow}
\usepackage{xcolor}
\usepackage{url}
\usepackage{tikz-qtree}
\usepackage{pdfpages}
\usepackage[pdftex, plainpages = false, pdfpagelabels,
                pdfpagelayout = useoutlines,
                 bookmarks,
                 bookmarksopen = true,
                 bookmarksnumbered = true,
                 breaklinks = true,
                 linktocpage,
                 pagebackref,
                 colorlinks = true,
                 linkcolor = blue,
                 urlcolor  = blue,
                 citecolor = blue,
                 anchorcolor = green,
                 hyperindex = true,
                 hyperfigures]{hyperref}

\begin{document}
\title{Experimental Evidence \\ in Electricity Behavior Research}
\author{Alexander L. Davis\corref{cor1}}
\author{Tamar Krishnamurti} 
\author{Baruch Fischhoff}
\author{Jay Apt} 

%\author{Denise Caruso} 
%\author{Daniel Schwartz}  
%\author{Jack Wang} 
%\author{Casey Canfield} 

\address{Department of Engineering and Public Policy \\ Carnegie Mellon University \\ 5000 Forbes Avenue, Pittsburgh, PA 15213}

\cortext[cor1]{Corresponding Author. Phone: 1-412-216-2040. Email: alexander [dot] l [l] davis1 [at] gmail.com}

\begin{abstract}  
We propose a practical and achievable ``gold standard'' for electricity field trials.  We discuss its content and rationale, providing justifications for why each element is important, then detail how field research looking at electricity consumption can be designed so that the standard is met.  We present a detailed example of a hypothetical in-home display field trial to demonstrate feasibility, along with a discussion of the costs and benefits of adhering to the standard.  We conclude with arguments for why federally funded research on human behavior and electricity consumption should be required to meet this standard, just as the FDA requires this high standard of evidence for drug approval.   
\end{abstract}

\begin{keyword}
Field Trials \sep Research Methods \sep Reporting \sep Electricity Behavior
\end{keyword}

\renewcommand{\topfraction}{1.0}

\maketitle
\section{Introduction}
Interventions that aim to reduce residential electricity consumption (e.g. providing information about electricity use on a custom in-home display) can yield small but important reductions in overall residential electricity demand \cite{davis2012setting}.  Customers benefit in the form of lower electricity bills, and utilities benefit by reducing ``incremental capacity, transmission, and distribution investments'' \cite{faruqui2010impact}.  Perhaps most importantly, reducing demand curtails rapidly increasing carbon emissions, striving to reach the goal of limiting global temperature rises to $2^{\circ}$ Celsius in the next century.

Without reliable evidence of effectiveness, money will be wasted on interventions that either fail upon implementation or limp on without any real understanding of their usefulness \cite{tunis2003practical}.  Well-designed experiments produce reliable evidence by unambiguously establishing both the magnitude of an intervention effect and whether it is causal.  Poorly designed ones often yield conflicting results that require repeated (and costly) data collection to sort out the facts.  Because poorly designed experiments will likely suffer from common flaws and, as a result, common bias, the aggregate knowledge produced is not much better than what is learned from a single study.  Policy makers should be reluctant to accept such evidence, as a single flawed study can easily support the incorrect conclusion that an ineffective treatment is effective, or vice versa.  Even when the safety or efficacy of a new treatment appears obvious to the utility, such as the implementation of smart meters, a single scientifically rigorous study could quell potential customer backlash.  A single good trial, carefully designed to minimize bias, can provide invaluable accurate evidence, resulting in diminished long-term costs to the utility from continued ``pilot'' studies and  the implementation of only those technologies and programs that most benefit customers. 

%\cite{gleeson1994blue} \footnote{\url{http://www.bcbs.com/blueresources/tec/tec-assessments.html}} \footnote{\url{http://grants.nih.gov/grants/guide/rfa-files/RFA-DK-01-024.html}} \cite{green1984using}

The FDA routinely makes decisions about whether to accept evidence of the effectiveness of a new drug or medical device in the form of phase I--III clinical trials. These trials are structured similarly to trials on electricity consumption behavior, so the FDA's standard of evidence serves as a useful guideline for the electricity industry to adopt.  At first the focus and outcomes of FDA trials, such as cancer treatment and lives lost, may seem incommensurable with the outcomes of electricity field trials. However, it is precisely because the outcomes of medical clinical trials are so imperative that the necessary time and thought has gone into creating a pristine, but also easily translatable, approach. FDA’s evaluations ultimately hinge on having two ``gold standard'' double blind, placebo controlled, randomized trials showing effectiveness. We show how this same approach can be applied to electricity field trials to gather the right evidence for an ultimately lower long-term cost. 

%\cite{friedman2010fundamentals}

In this paper we argue that the FDA's standard of evidence should be adopted by all electricity industry decision makers, from public utility commissions to the Department of Energy, when deciding whether to invest in new technologies or implement behavioral interventions.  We discuss the standard of evidence the FDA accepts, giving examples of problems and solutions from biomedical research, and then make the case for applying this standard to research on electricity behavior, using an in-home display field trial as an example.  Finally, we provide a simple checklist for designing research that meets the gold standard.

%\cite{gawande2010checklist}

\section{The Gold Standard}
In this section we discuss the important elements of clinical trials that are relevant to interventions that affect electricity consumption, drawing on government standards (e.g., The International Conference on Harmonization,\footnote{\url{http://www.ich.org/products/guidelines/efficacy/article/efficacy-guidelines.html}} and the NCI,\footnote{\url{http://www.cancer.gov/clinicaltrials}}) texts on the topic \cite{friedman2010fundamentals,meinert2012clinicaltrials}, and quality standards (e.g., GRADE \cite{guyatt2008grade}).

%\cite{mcjoynt2009building,goldman2003incremental,mills2006barriers,wendler2005racial,embi2005effect,townsley2005systematic,lind1772treatise}.
  %,\url{http://prevention.cancer.gov/clinicaltrials/}} NIH,\footnote{\url{http://health.nih.gov/topic/ClinicalTrials}}, FDA,\footnote{\url{http://www.fda.gov/ScienceResearch/SpecialTopics/RunningClinicalTrials/ucm114928.htm}} and the US Preventive Services Task Force\footnote{\url{http://www.uspreventiveservicestaskforce.org/}}
  
Specifically, we propose that a type of experimental approach used in clinical trials, called a \emph{pragmatic} or \emph{effectiveness} randomized controlled trial (RCT), matches the needs of electricity research in terms of both cost and practical application \cite{roland1998understanding}.  Pragmatic trials use large representative samples with minimal exclusions \cite{tunis2003practical}, often comparing established treatments against each other to find the best one, such as massage against acupuncture for treatment of lower back pain \cite{cherkin2001randomized}.  Pragmatic trials are used in biomedical research in circumstances that resemble those faced in electricity research, where diagnoses are clear (e.g., a need to cut peak load), conditions are common (e.g., load patterns are predictable and apply to most customers), practical advice on best interventions is needed (e.g., which pricing program or technology is most effective), and interventions are easily implemented (i.e., they don't require technical experts to implement).  

%,emanuel2000makes,simoons1993international,garg1996rationale
%\cite{roland1998understanding}

We acknowledge that it may not always be easy to implement pragmatic RCTs. Using heterogeneous groups and a representative sample means that the trial requires a large sample size and as such, may have a higher cost than a less rigorous trial.  The large sample size also makes quality control and the use of very precise measurements difficult, as precise procedures and quality assurance do not scale well.  However, even in the face of these limitations, there is rarely a case when non-random non-controlled alternatives (currently the most common approach to electricity intervention testing) are a better option than an RCT for gathering adequate evidence.

%,byar1976randomized,spodick1982randomized,london2007clinical

Pragmatic trials can provide the highest quality evidence on the effectiveness of a new technology or intervention, but the evidence that is produced is only as good as the soundness of the study design. In the next few section we discuss the elements of the study design that are necessary to conduct an ideal successful RCT.  

%Even when equipoise is credible, participants must be informed that they may not receive the treatment by being told their probability of receiving no treatment or placebo so they can assess their own equipoise when making their enrollment decision \cite{friedman2010fundamentals}.
\section{Design}
\subsection{Background, Rationale, and Systematic Review}
The design of an RCT begins by establishing the background and rationale of the study.  This is facilitated by formulating study objectives into testable hypotheses, specifying measurable benefits and risks.  

% \cite{friedman2010fundamentals}

A systematic review then looks at published and unpublished scientific research on the topic.  Theoretically plausible and empirically supported mechanisms of action may be discovered from the review, allowing the study design to be refined as to be able to detect these mechanisms.  The review also includes a quantitative meta-analysis that allows calculation of the expected effect size, power, and required sample size of the trial, taking into account the methodological quality of data \cite{turner2009bias}.

%\footnote{\url{http://effectivehealthcare.ahrq.gov/ehc/products/60/318/MethodsGuide_Prepublication-Draft_20120523.pdf}} ,guyatt2011grade,viswanathan2012development

\subsection{Internal Validity}
A study's internal validity is the degree to which it can be used to make causal inferences.  This approach formalizes the intuitive notion of causality, which holds that an intervention is causal if it was not otherwise affected by uncontrolled variables, only affects the outcome (and not vice versa), and works as intended \cite{scheines2005similarity}.  In the past twenty years causality has been precisely defined in the mathematical formalism of graph theory \cite{pearl2000causality,spirtes2000causation}.  To establish internal validity, RCTs use a concurrent control group, randomization, and blinding.  

%\cite{hernan2004structural,robins2000marginal}

\subsubsection{Control Group}
To demonstrate causality, one must simultaneously show that the same participant benefits when given the treatment and does not benefit when not given the treatment.  Unfortunately this is impossible, as the same participant cannot simultaneously receive and not receive the treatment.  One commonly used approach in the electricity industry is to apply statistical methods to, for example, predict the person's future behavior from past behavior.  This approach is unlikely to work, as no statistical model can account for unforeseeable changes in circumstances or unique historical events that may make the future quite unlike the past.

An RCT approach uses a concurrent control group that gives some participants the treatment and a separate group of participants, sampled in the same way from the same population during the same time period, no treatment.  As each participant is equally likely to be in the control or treatment group, in the long run the control group should accurately predict what those in the treatment group would have done had they not received the treatment.
%\footnote{\url{http://www.ich.org/products/guidelines/efficacy/article/efficacy-guidelines.html}}

The causal argument from an RCT is delicate, as causality is established only if the control and treatment group are treated exactly alike.  If, on the other hand, those who receive the treatment know they received it, whereas those in the control group know they didn't, then their beliefs will differ, introducing non-equivalence.  This is why placebo groups are used in RCTs, as they ensure that participants cannot tell whether they received the treatment or control, thereby equating the beliefs of those in each group.  It is for this reason that special care must be taken to make the placebo as identical to the treatment as possible.  For example, a pilot trial of dynamic pricing must notify participants in the placebo group of peak time hours, even while charging them a fixed rate per kwh.  Creating the appropriate study recruitment language to temper expectations in the placebo group is challenging, but certainly not insurmountable. 

%\footnote{\url{http://www.ich.org/products/guidelines/efficacy/article/efficacy-guidelines.html}}

\subsubsection{Randomization}
The only way to make sure each participant is equally likely to be in the control or treatment group is to use a randomizing device, such as a pseudo-random number generator.  This prevents the participant and experimenter from determining what group the participant will be assigned to by adding unpredictability into group assignment \cite{kunz1998unpredictability,byar1976randomized}, in the long run balancing variables other than the treatment between the groups \cite{altman1990randomisation}.  With the best intentions a researcher may, consciously or unconsciously, be tempted to place a larger household with the potential for greater benefit in an in-home display treatment group and a single-person smaller household in the control.   Similarly a participant may only select to sign-up for a trial during the period in which the most personally appealing offer is available. 

%fisher1926arrangement \footnote{When participants are naturally clustered, for example based on geographic location, then randomization across clusters rather than individuals may improve study quality by reducing contamination/treatment diffusion \cite{cornfield1978symposium}.}

Two critical aspects of implementing randomization are \emph{Sequence Generation} and \emph{Allocation Concealment}.  Sequence generation is the process by which randomization is done, for example using a pseudo-random number generator \cite{schulz2002generation}.  Random sequences are necessary for group assignment because they are unpredictable, preventing the participant and experimenter from determining the participant's group assignment.  Non-random methods, such as alternating assignment or assignment by birth date, allow the possibility that participants enroll in such a way that they receive the treatment of their choice. Using a simple randomization method, such as a pseudo-random number generator, is an easy way to avoid accidentally using non-random sequences that are perceived to be random \cite{tune1964response}.

%\cite{kunz1998unpredictability,byar1976randomized,fisher1926arrangement,altman1990randomisation}.\footnote{When participants are naturally clustered, for example based on geographic location, then randomization across clusters rather than individuals may improve study quality by reducing contamination/treatment diffusion \cite{cornfield1978symposium}.}

Allocation concealment prevents the researcher and participant from knowing the random sequence and thus whether the next participant will be assigned to the treatment or control group.  The allocation must be concealed to both the researcher and participant, as researchers have been known to decipher and subvert random assignment out of curiosity or the desire to ``help'' the participant \cite{schulz2002allocation,schulz1996randomised,schulz1995unbiased}.  

For example, suppose an in-home display trial is conducted where the researcher administering the trial has two possible boxes to mail to a participant, one with the letter A on it and one with the letter B on it. This researcher calls a central facility (for example, the in-home display vendor) and asks whether to give the participant box A or box B, not knowing which contains the fully-functional in-home display and which contains the placebo device (details on what might comprise a placebo in-home display device can be found in section 4.2.3).  The central office then generates a random number assigning box A if the number is odd and box B if the number is even. The assigned box (A or B) is then recorded by the central office, given to the participant, and is then collected at the end of the study to make sure it was correctly administered. If the vendor is in charge of pushing information to the in-home display, the researcher may never need to be unblended to the participants’ condition.  This procedure prevents the researcher from subverting the treatment assignment and does not allow the participant to decide whether to continue participation in the study with knowledge of which treatment she received.  Failing to adequately implement and report allocation concealment can severely undermine the study's internal validity, as there is evidence that studies that lack allocation concealment are consistently biased in favor of effectiveness \cite{chalmers2001comparing}.

%There are alternative group assignment methods to randomization, but they are sufficiently flawed that randomization is almost always superior.  One example of such a method is Zelen's approach, which randomizes all eligible participants but only those assigned to the treatment group, following those in the control group without recruiting or notifying them \cite{anbar1983relative,ellenberg1984randomization,zelen1990randomized}.  This approach is not a good choice because consent cannot be obtained from those in the control group, participants in the control group are recruited differently from those in the treatment group so selection bias may emerge, and blinding cannot be used because those in the treatment group know that they are receiving the treatment.  A second non-randomized group assignment method is matching, where specific characteristics of participants (e.g., age, gender) are statistically balanced between the control and treatment group.  However, matching requires that one knows beforehand all the factors that could make participants benefit or not, which is almost never the case.

\subsubsection{Blinding}
Merely knowing that one is in the treatment rather than control group can change both the way a participant behaves and the way they are treated, making the need to hide the treatment assignment to participants and researchers paramount.  If the hypothesis being tested is telegraphed, participants may consciously or unconsciously behave in a way consistent with that hypothesis \cite{orne1969demand,orne1962social}.  For example, if a participant is aware that they are in a study with a goal of consumption reduction, they may use less electricity than usual, regardless of the actual intervention.  

In one of the first blinded experiments, Benjamin Franklin, Antoine Lavoisier and others gave ``unmagnetized water'' to those who believed ``animal magnetism'' applied to water could cure illness.  Franklin and Lavoisier observed the same hysteric, and supposedly curative, response to both magnetized and unmagnetized water \cite{franklin2002report}.  Without using a blinded experiment, the French government may have concluded that animal magnetism was an effective treatment.

To qualify as a ``double-blind'' trial, participants and those who interact with the participants or the data must not know the treatment assignment of any participant \cite{haahr2006blinded}.  However, the definition, use, and implementation of ``blinding'' is so poorly agreed upon that some have proposed removing it from experimental terminology \cite{miller2011blind}.  For example, medical studies report little or no information about what is meant by ``double blind'' in reports of their trial \cite{haahr2006blinded}, and less than half of studies that could use double blind report doing so \cite{schulz1996blinding}.

Maintaining blinding is difficult because unpredictable events and haphazard experimental design can unblind participants or researchers \cite{sackett2007commentary}.  Contacts (e.g., peak time notification), visits (for equipment installation), and treatment adjustments may convey information about group assignment and thus unintentionally unblind or otherwise make groups non-equivalent.  Therefore, all actions done with the treatment group are also appropriately mimicked with the placebo group.  

%boutron2006methods,boutron2005review,sackett2004turning, schoenberger1980randomized,


%\cite{bull1959historical,lilienfeld1982ceteris} Digitalis \cite{trial1997effect} \cite{intermittent1983intermittent} \cite{silverman1977lesson} Amberson introduced both random assignment and blinding to clinical trials in humans \cite{amberson1931clinical}.

\subsection{External Validity}
Internal validity is the degree to which causal inferences can be made about the intervention in the study sample.  However, RCTs aim to address causal effects in the population of interest, not just the sample, which requires an assessment of the study's external validity, or the degree to which the sample is representative of the population.  If the sample was taken randomly from the population, then the causal effect in the sample is an unbiased estimate of the causal effect in the population.  While the practical constraints on trials will inevitable produce an imperfect sample, minimizing problems with \emph{Exclusion Criteria}, \emph{Volunteer Adjustment}, and \emph{Withdrawal Prevention}, should allow for a sample that is as close to truly random as possible. 

\subsubsection{Eligibility and Exclusion Criteria}
Pragmatic trials define the population of interest as all people who may potentially benefit from the intervention, attempting to impose minimal exclusion restrictions on this population.  Because including such a heterogeneous population in the study can increase variability in the data, the sample size needs to be larger to obtain adequate statistical power.  The advantage is that the sample obtained from this approach is representative of the population and thus externally valid \cite{tunis2003practical}, meaning recommendations for any member of the population can be made because they could have been in the sample \cite{guyatt1994users}.

Some exclusions are almost inevitably necessary, however.  For example, women would neither provide information on the effectiveness of a prostate cancer treatment, nor could they possibly benefit, and it would be unethical to expose them to an intervention that had any risk of adverse events without the potential to benefit them or provide scientific knowledge. Similarly, in an electricity intervention, it would not make sense to give an in-home display to the homeless.  Eligibility and exclusion criteria avoid this situation by specifying who can participate in the study and who cannot.

These exclusions should ideally be strongly justified, but in practice are often based on weak justifications.  A strong justification for exclusion may be the participant was unable to consent, may be harmed by the treatment, or that the intervention may be confounded for that participant (e.g., by cointerventions such as being on both a time-of-use and flat-rate tariff).  Weak justifications are based on unmotivated socio-demographic or health factors, such as age, sex, IQ, or chronic condition \cite{van2007eligibility}. 


%\cite{sung2003central} \cite{fossaa2002selection} \cite{britton1999threats} \cite{juurlink2004rates} %\cite{veterans1970effects} \cite{freis1967effects} \cite{veterans1972effects}  %A log or registry of eligible participants and those that choose to enroll can be used \cite{pedersen1983norwegian}  coronary artery surgery study: a randomized trial of coronary artery bypass surgery. Comparability of entry characteristics  1976 + wilhelmsen + a comparison between participants and non participants in a primary prevention trial  \cite{barter2007effects} \cite{karlowski1975ascorbic}

\subsubsection{Volunteer Adjustment}

Although pragmatic trials use minimal exclusions, the decision to participate in the trial is ultimately up to the participant.  In most trials those who participate are volunteers who may systematically differ from those who choose not to volunteer \cite{tunis2003practical,smith1990mortality}.  For example, those with higher education, higher socioeconomic status, and women are more likely to volunteer to participate in research studies than those who are less educated, have lower SES, or are men \cite{rosenthal1975volunteer}.

Recruiting participants into trials is difficult, with some able to recruit only 2-3\% of those offered to participate \cite{wright2002factors}.  The success of the recruitment depends on a number of factors, such as whether the recruiter is comfortable discussing the uncertainty in risks and benefits of trial participation \cite{wright2002factors}, whether the participant perceives their role in the study as being a ``guinea-pig'' \cite{bevan1993patients}, trust in the researcher, and time constraints \cite{cox2003patients}.

Best practices for recruitment based on systematic reviews are available \cite{treweek2010strategies} and consistently show the effectiveness of using an ``opt-out'' design, where participants are assumed to want to participate in the study unless they explicitly refuse study participation, rather than fail to respond.  This is supported by evidence showing that those who fail to respond to recruitment do so not because they are refusing to participate, but because of other reasons such as not getting around to it or being misinformed about the study \cite{williams2007no,junghans2005recruiting}.

The RCT takes measures to maximize recruitment and accommodate volunteering when people refuse to participate.  One approach to accommodation is to use propensity score approaches that model each participant's probability or ``propensity'' to volunteer \cite{wooldridge2009introductory,gelman2007data}.  Such a model may be developed from psychodemographic variables collected on the recruited population.  If the model is accurate, then statistically controlling for propensity to volunteer allows one to make valid inferences from sample to population.

%Training and evaluating recruiters can also dramatically increase recruitment success \cite{donovan2009development}.  Recruitment contacts, such as phone calls or in-home visits are done at the convenience of potential participants, for example during the evening or on the weekend.  Participants are given adequate time to contemplate their decision and discuss it with their family, and are not armtwisted, as this may increase subsequent withdrawal \cite{fitzpatrick2006recruitment}.  Pilot studies are used to estimate recruitment rates and possibly change recruitment strategy.

%\cite{bower2005patient} \cite{mills2006barriers} \cite{ross1999barriers}
%Phone calls from a trained researcher can improve knowledge but may not improve recruitment rates \cite{aaronson1996telephone}.  
%\cite{albrecht1999strategic} \cite{sharp2006reasons} \cite{lovato1997recruitment} \cite{mcdonald2006influences} \cite{treweek2010strategies} \cite{watson2006increasing} \cite{hunninghake1987summary} \cite{hunninghake1987summary}  
%\cite{van2011volunteers}.

\subsubsection{Withdrawal Prevention}

Retaining participants in a study is just as difficult as recruiting them \cite{koog2012barriers}.  Some of those who agree to participate in the study may not remain in the study until it is complete.  If those who are not benefiting from the study also withdraw, then average treatment effects observed in the study will be biased toward showing a greater benefit than there really is.

%,sangi2009attrition,ye2011data,ulmer2008usefulness,severi2011two,hunt1998retaining

One approach to accommodate withdrawals is to measure primary outcome variables shortly after baseline to capture critical information before withdrawals occur \cite{jordhoy1999challenges}. For example, a critical peak pricing program should plan to call several critical peak days soon after the study begins.  Another approach uses appropriate exclusion criteria during recruitment that accurately predict whether people will be able to complete the study.  

Categories or ``themes'' of retention strategies include community involvement, study identity, training personnel, study description, contact and scheduling methods, reminders, visit characteristics, study benefits, financial incentives, reimbursement, non-financial incentives, and tracking methods \cite{robinson2007systematic}.  Using multiple strategies across multiple retention themes may be effective.

%There is also some quantitative evidence that incentives increase retention, with higher incentives leading to greater retention \cite{booker2011systematic}.  There are exceptions, however, as novel payments (e.g., a \$2 bill \cite{doody2003randomized}) may be more effective than larger ones (e.g., \$5).  

%\cite{weissert1980cost}  \cite{cox2000enhancing} One study of home care had a 66\% attrition rate, most losses due to death \cite{mccorkle1989randomized}.  
%Preventing withdrawals is difficult \cite{addington1992randomised,kane1984randomised}, as participants may die, become ineligible, or change their mind and refuse to participate after initially agreeing \cite{mor1988day}.

\subsection{Statistical Validity}
RCTs stick to basic statistical analyses, for example looking at treatment effects on the whole population, before turning to more complex models and comparisons, such as preplanned subgroup analyses.  Once participants are in the study and randomly assigned to condition they are analyzed according to this condition assignment regardless of whether they adhered to the treatment regimen or dropped out of the study \cite{hollis1999meant}.  Alternative approaches that attempt to ``guess'' at missing data, called imputation \cite{ibrahim2005missing}, can complement the original intent-to-treat analysis.

%\footnote{On a sidenote, subgroups have been surprisingly similar in response to treatment \cite{califf2002principles,trialists1994collaborative,ace1998indications,flather2000long}.} viswanathan2012interventions, ,schulz1996blinding sapirstein1994role,little1987statistical,

Careful attention must be paid to the timing and choice of statistical analyses, as experts on clinical trials report that failing to appropriately conduct and report statistics is the most common form of trial misconduct \cite{al2005effect}.  A representative example is the case where one statistical test produces a p-value greater than .05, a different test (with different assumptions) produces one lower than .05, while covariate adjustments increase or decrease the p-value in arbitrary ways.  When a situation like this arises, unblinded analysts are likely to report only the approaches that produced statistically significant ($p<.05$) results \cite{leamer1983let,fischhoff1982those,simmons2011false}.

%\footnote{Others include changing endpoints \cite{julian2006data,braunwald2004angiotensin,anturane1982anturane}.}  

One way to maintain the validity of statistical analyses is to register the trial\footnote{\url{http://clinicaltrials.gov/}, \url{http://www.who.int/ictrp/en/}} and publish the protocol of the study ahead of time, including the plan for statistical analyses, making clear what statistical tests are planned and what are post-hoc.  This protocol includes the statistical methods that will be used, the endpoints that will be compared, the target sample size \cite{freiman1978importance}, a statistical power analysis \cite{cohen1992power}, and clear rules for early stopping \cite{simon1989optimal}.  Blind data analysis can complement the protocol publishing approach, but researchers must be wary of pressure that may be put on blinded analysts if results do not come out as predicted \cite{gotzsche1996blinding}.

%\footnote{The journal Trials would is a good venue for protocol publishing \url{http://www.trialsjournal.com/}} cohen1962statistical,rossi1990statistical,sedlmeier1989studies, \footnote{\url{http://biometry.nci.nih.gov/power2/main.html}} ,fleming1982one \footnote{\url{http://jco.ascopubs.org/content/9/12/2225.abstract}}

\subsection{Reporting}
The Consolidated Standards of Reporting Trials (CONSORT) is a reporting standard that includes all critical information necessary to evaluate the validity of a clinical trial \cite{schulz2010consort,moher2010consort}.\footnote{The CONSORT checklist: \url{http://www.consort-statement.org/index.aspx?o=2965}}  It covers elements that are usually reported, such as the hypotheses of the study, as well as those that are frequently overlooked, such as changes to the protocol after beginning the trial, eligibility and exclusion criteria, sample size determination, the method of sequence generation, how blinding was done, how similar the treatment and placebo were, the allocation schedule and code-breaking, and an evaluation of the success of the blinding.
%(Baruch's paper) \cite{fischhoff2012communicating} \cite{fischhoff2012good} \cite{schulz1996blinding,altman2004turning,bang2004assessment,hemila2005assessment,sackett2007commentary,hrobjartsson2007blinded,kaptchuk1998intentional}.\footnote{\url{http://squire-statement.org/guidelines}}

Along with meeting the requirements of the CONSORT statement, careful attention is paid to selective reporting.  Selective reporting occurs when a measure or hypothesis that is not statistically significant is not included in a published report \cite{turner2008selective}.  Selective reporting is common and can undermine the validity of the study report, making treatments look effective when they are not, as all of the negative data are omitted from publication \cite{de2006normal,martinson2005scientists,goldacre2012bad,yong2012bad}.

%,heres2006olanzapine,steinbrook2005gag,psaty2003stopping

To avoid selective reporting, RCTs commit to data sharing and reproducibility.  Data sharing means providing data to others so they can review the data, check for errors, and reuse it for secondary analyses.  This can be done in a variety of ways, including using online databases such as Harvard's Dataverse.\footnote{\url{http://thedata.org/}}  Ideally, enough information will be provided about the methods, design, and statistical analysis so that the trial can be replicated and the statistical analyses verified by independent third parties \cite{stodden2009enabling}. In those instances in which data may be commercially sensitive – a realistic challenge with most utilities - measures can be taken to anonymize individual data and even the specifics of treatment conditions, so as to merely allow for independent replication of analyses. For example, participants can be assigned a numeric identifier and treatment and control conditions can merely be labeled as ``A'' and ``B,'' allowing subsequent researchers to determine if there are outcome differences between the two groups without knowing any greater detail about them.  To facilitate reproducibility of the statistical analyses the code can be shared, for example using a combination of the opensource softwares R, \LaTeX~and Sweave \cite{leisch2002sweave}.  To facilitate reproducibility of the trial itself, the complexities of the study can be reported in open lab notebooks, such as Open-Wetware.\footnote{\url{http://openwetware.org/wiki/Main_Page}}

%\footnote{\url{http://academiccommons.columbia.edu/item/ac:154852}}

\section{An In-Home Display Trial Example}
In this section we provide an example of a hypothetical but highly realistic in-home display trial. This trial incorporates elements of trials that have been conducted in the field, making it both practical and achievable. However, it has been modified to meet the standards the FDA would accept. It is reported according to the CONSORT statement.\footnote{\url{http://www.consort-statement.org/consort-statement/}}

\subsection{Introduction and Systematic Review}
Researchers have extensively studied three approaches to reducing demand-side electricity consumption: dynamic pricing, in-home displays, and automated control systems, such as smart thermostats.  A meta-analysis aggregating thirty-two studies on the topic, weighting each study by their precision and adjusting for plausible methodological bias \cite{davis2012setting}, suggested that in-home displays were the most promising approach to reduce overall consumption, whereas dynamic pricing and smart thermostats were effective for reducing peak but not overall demand.  None of the studies on in-home displays were randomized double-blind placebo controlled trials, so the present trial was conducted to definitively test the effectiveness of a custom in-home display to reduce overall electricity consumption among residential customers.

\subsubsection{Plausible Mechanism}
The literature review included the systematic meta-analysis and additional sources, finding that in-home displays likely reduce electricity consumption by promoting awareness of electricity use, where people realize that they are consuming ``an invisible product that is often ignored'' \cite{schembri2008influence}.  The first source of support for this mechanism comes from the finding that over the last forty years more sophisticated in-home displays (e.g., real-time feedback, graphical displays) have not been associated with larger effect sizes.  This suggests that it is ``the presence of the information itself---not its presentation in a more salient, graphical format---that is causing the behavior change'' \cite{allen2006effects}.  There is evidence that in-home displays do increase awareness, as Norton \emph{et al.} \cite{norton2008powercost} found that 75\% of participants reported being more aware of potential energy efficient actions after interacting with the PowerCost Monitor display.  Other evidence shows the association between the in-home display, changes in awareness, and changes in consumption.  For example, Hutton \emph{et al.} \cite{hutton1986effects} found the largest effects of the ECI display above and beyond education alone from participants in California, who knew the least about electricity consumption, as opposed to two Canadian cities where participants knew more.  Similarly, Yun \emph{et al.} \cite{yun2009investigating} found that participants given an in-home display who had low or moderate awareness of energy consumption at baseline reduced their energy consumption more than those who initially had high awareness.  This mechanism is not assured, however, as consciousness of problematic electricity use, and behaviors that contribute to them, is likely to be a necessary but not sufficient condition for behavioral change \cite{fischer2008feedback}.

%\cite{bandura2001social,thogersen2010electricity,bandura2001social,norman2002design,mcclelland1979energy,yun2009investigating,eiden2009investigation,seligman1978behavioral,dobson1992conservation,yun2009investigating,fischer2008feedback,seligman1978behavioral,yun2009investigating,yun2009investigating,paetz2011shifting,yun2009investigating,paetz2011shifting,hutton1986effects,mckenzie2011fostering,roberts2004consumer,steg2009encouraging,neenan2009residential,van1983patterns,wood2007energy}

\subsubsection{Objectives}
The primary hypothesis of the trial was that residential customers who were given in-home displays would reduce their overall electricity consumption more than those given a placebo display.  

The secondary hypothesis was that participants who report becoming more aware of their electricity use, based on an electricity knowledge test and subjective scale, would benefit from the in-home display, whereas those who did not would not benefit.

\subsection{Methods}
\subsubsection{Trial Design}
To test the primary and secondary hypothesis we conducted a sixteen month double-blind randomized concurrent controlled trial comparing an in-home display treatment group against placebo group.  No changes were made to methods or eligibility criteria after the trial commenced.

\subsubsection{Participants}
The sample frame for the study included all participants in a geographic location (e.g., Pennsylvania) who had a smart meter linked to their home or apartment.  This inclusion criterion was necessary so that information could be communicated from the smart-meter to the in-home display in real-time.  The only other eligibility criterion was that participants must not have expressed that they were likely to move out of their home or apartment during the one year study period.  

\subsubsection{Interventions}
Both the functional in-home display and the placebo provided tips on saving energy, weather, temperature, and date/time capabilities.  However, only the functional in-home display provided real-time electricity use information.

\subsubsection{Outcomes}
Monthly electricity use for each participant was extracted using hourly smart-meter data.  The secondary outcome was a self-reported awareness measure taken at both baseline (before randomization) and again at close-out (at the end of the study).  Participants were asked to list in an open-ended format all the factors that consume electricity in their household.  There were no changes to trial outcomes after the trial commenced.

\subsubsection{Sample Size and Power}
To estimate statistical power and choose the study sample size, we use the adjusted HLM estimate of the effect size from Davis \emph{et al.} \cite{davis2012setting}, which had a Cohen's $d$ of 0.63 \cite{cohen1962statistical}.  Based on this expected effect size, a simple two-sample t-test comparison between the average electricity consumption of those in the in-home display versus placebo group, each with 300 participants, would have over 90\% power to detect the effect by rejecting the null hypothesis.  Even with an effect size 5 times smaller than that suggested by prior evidence, a sample size of 300 in each group would have a 50\% chance of detecting the effect.  Thus, a sample size of 300 participants in each group provides a study that is very sensitive to effects in the range of plausible values based on prior data.

There were no planned interim analyses or stopping guidelines.

\subsubsection{Sequence Generation}
After 600 participants consented to participate, they were then immediately randomly assigned, using simple randomization, to one of two groups using the following procedure:  
\begin{enumerate}
\item Every recruited customer in the sample was given a random number from 0 to 1 using the ``=rand()'' function in Excel 2010.
\item The random numbers were then sorted from smallest to largest.
\item The first 300 customers were assigned group label A.
\item The second 300 customers were assigned group label B.
\end{enumerate}

\subsubsection{Allocation Concealment}
After random assignment, the customer information and group labels were then recorded on each participant's digital case report form.  A neutral third party, who, in addition to the in-home display vendor, was the only unblinded person, then generated a random number from 0 to 1 using the ``=rand()'' function in excel for each group label.  The group label with the lowest number was assigned to the treatment group, the next lowest to the placebo group.  The neutral party then informed the vendor to make two types of ``welcome packages.''  These packages had identical weight, center of gravity, and size, and only differed on whether they contained the active in-home display or placebo control in-home display whether they were labeled A or B.  The vendor then mailed the packages to the participants.  Both the researchers and the vendor knew only the customer name and the group label of the package that was sent.

\subsubsection{Implementation}
The random allocation sequence was generated by Jay Apt.  Participants were enrolled by Alexander Davis and Tamar Krishnamurti.  Participants were assigned to interventions by Alexander Davis and Tamar Krishnamurti.

\subsubsection{Blinding}
Once the in-home displays were received, the study investigators did not know whether each individual participant had an activated or placebo frame, and all personnel interacting with the participant or the data did not know group assignment. The recruitment document [see Appendix A] did not provide participants with an expectation of real-time feedback on the in-home display, minimizing the risk of frustrated expectations in the placebo group. The interventions were identical except for the uninformative group label on the outside of the welcome package. Blinding was broken before the manuscript was submitted for publication to correctly label statistical analyses. 

\subsubsection{Statistical Methods}
Simple independent-sample t-tests compared the total electricity used in the in-home display treatment group and placebo group.  The mediating factor of awareness was tested by first using simple linear regression of treatment group assignment on awareness, and then treatment group assignment on electricity consumption, controlling for awareness.  One additional analysis was performed.  The propensity score model was a simple linear regression of treatment group assignment on electricity use controlling for each participant's propensity to volunteer.

\subsection{Results}
\subsubsection{Recruitment}
Recruitment began February 1, 2012 and ended April 1, 2012.  Follow-up lasted from May, 1 2012 to September 1, 2013.  The trial was ended September 1, 2013 as planned.

Based on prior recruitment rates in the medical literature that use recruitment best practices, as well as experiences from other in-home display trials, we expected a 10\% recruitment rate.  From the set of residential customers who met the inclusion criteria (the sample frame), the sample was selected from the sample frame using the following randomization process:

\begin{enumerate}
\item Every customer in the sample frame was given a random number from 0 to 1 using the ``=rand()'' function in Excel 2010.
\item The random numbers were then sorted from smallest to largest.\footnote{Note that we sorted the values of the rand function and not the rand function itself.}
\item The first 6000 customers in the sample frame with the lowest random numbers were included in the sample.
\end{enumerate}

Eligible customers included in the sample were mailed a recruitment document that was pre-tested for enrollment quality and constructed based on best practices \cite{treweek2010strategies,edwards2009methods}.  The recruitment document included information about the study, and provided customers three ways to enroll: by email, by returning an addressed and stamped postcard, or by calling the researchers on a 1-800 number.  Recruiters who answered phones or responded to emails were certified for being able to handle participant questions and give accurate information about the study \cite{donovan2009development}.

Employing an opt-out design, one week after the first recruitment mailing, those who did not respond were sent a post-card reminding them to sign up and eliciting reasons for not responding.  Two weeks after the first recruitment all non-responders were contacted by phone to recruit them, inform them about the study, or understand their reasons for refusal.  Three weeks after the first recruitment all remaining non-responders were again contacted by phone.  Screening logs were maintained to determine reasons for refusal and exclusion to allow us to determine how the enrolled and refused population differ.  After three weeks the set of participants who did not opt-out of the study was complete, and if this did not include enough participants then the process was repeated taking another random sample of 6000 customers.  

Those who did not opt-out were then contacted by phone to discuss the purpose of the trial, explain the informed consent, and acquaint the participant with the study and requirements.  A second call then answered questions raised and reviewed the study requirements a second time.  Participants were told, under the requirements of informed consent, that they would be given an in-home display that would provide information along with electricity saving tips.  Those assigned to the treatment or placebo in-home display groups did not know whether they got the treatment in-home display or control in-home display.  During the second call the participants were asked to verbally confirm consent to the study, verified by an audio record.

\subsubsection{Baseline Data}
<<demographics,resulse=hide,fig=false,echo=false>>=
age.t<-runif(300,18,68) 
age.p<-runif(300,18,68) 
gender.t<-rbinom(300,1,0.5)
gender.p<-rbinom(300,1,0.5)
options(digits=2,scipen=2)
ztrunc<-function(t){
  q<-t
  t<-ifelse(abs(t)<0.001,.01,t)
  t<-sprintf("%.2f",t)
  t<-substr(t,ifelse(q<0,3,2),nchar(t))
  t<-ifelse(q<0,paste("-",t),t)
  t<-ifelse(q>1,sprintf("%.2f",q),t)
}
@ 

Demographic characteristics are summarized in Table~\ref{tab:demo}.\footnote{All data, materials, and statistical analyses are available at Harvard's Dataverse \url{http://hdl.handle.net/1902.1/20271} V1 [Version]}  Participants in the control and treatment group were balanced on both demographic factors.

\begin{table}[h]
  \centering
  \caption{Demographics}
  \label{tab:demo}
  \begin{tabular}{c c c c}
    Demographic & Treatment & Placebo & t-test (p-value) \\ \hline
    Age & \Sexpr{prettyNum(mean(age.t))} (\Sexpr{prettyNum(sd(age.t))}) & \Sexpr{prettyNum(mean(age.p))} (\Sexpr{prettyNum(sd(age.p))}) & \Sexpr{prettyNum(abs(t.test(age.t,age.p)$statistic))} (\Sexpr{ztrunc(t.test(age.t,age.p)$p.value)})   \\ 
    Gender (Male) & \Sexpr{ztrunc(mean(gender.t))} (\Sexpr{ztrunc(sd(gender.t))}) & \Sexpr{ztrunc(mean(gender.p))} (\Sexpr{ztrunc(sd(gender.p))}) & \Sexpr{prettyNum(abs(t.test(gender.t,gender.p)$statistic))} (\Sexpr{ztrunc(t.test(gender.t,gender.p)$p.value)})   \\  \hline    
\end{tabular}
\end{table}

\subsubsection{Numbers Analyzed}
Three hundred participants were initially assigned to each group, and all three hundred from each group were included in all statistical analyses according to their original group assignments. 

\subsubsection{Outcomes and Estimation}

<<outcomes,results=hide,echo=false,fig=false>>=
use1.t<-rnorm(300,1000-0.63*200,200)
use1.p<-rnorm(300,1000,200)
use2.t<-rnorm(300,1000-0.63*200,200)
use2.p<-rnorm(300,1000,200)
use3.t<-rnorm(300,1000-0.63*200,200)
use3.p<-rnorm(300,1000,200)
use4.t<-rnorm(300,1000-0.63*200,200)
use4.p<-rnorm(300,1000,200)
use5.t<-rnorm(300,1000-0.63*200,200)
use5.p<-rnorm(300,1000,200)
use6.t<-rnorm(300,1000-0.63*200,200)
use6.p<-rnorm(300,1000,200)
use7.t<-rnorm(300,1000-0.63*200,200)
use7.p<-rnorm(300,1000,200)
use8.t<-rnorm(300,1000-0.63*200,200)
use8.p<-rnorm(300,1000,200)
use9.t<-rnorm(300,1000-0.63*200,200)
use9.p<-rnorm(300,1000,200)
use10.t<-rnorm(300,1000-0.63*200,200)
use10.p<-rnorm(300,1000,200)
use11.t<-rnorm(300,1000-0.63*200,200)
use11.p<-rnorm(300,1000,200)
use12.t<-rnorm(300,1000-0.63*200,200)
use12.p<-rnorm(300,1000,200)
use13.t<-rnorm(300,1000-0.63*200,200)
use13.p<-rnorm(300,1000,200)
use14.t<-rnorm(300,1000-0.63*200,200)
use14.p<-rnorm(300,1000,200)
use15.t<-rnorm(300,1000-0.63*200,200)
use15.p<-rnorm(300,1000,200)
use16.t<-rnorm(300,1000-0.63*200,200)
use16.p<-rnorm(300,1000,200)
#aware.t<-runif(1,use.t+rnorm(1,0,1)
#aware.p<-
subj<-1:600
kwh.p<-matrix(data=NA,nrow=300,ncol=16)
for(i in 1:300){
kwh.p[i,]<-rnorm(16,1000,200)+rnorm(1,0,50)
}
kwh.t<-matrix(data=NA,nrow=300,ncol=16)
for(i in 1:300){
kwh.t[i,]<-rnorm(16,1000-0.63*200,200)+rnorm(1,0,50)
}
treat<-c(append(rep(0,300),rep(1,300)))
kwh<-rbind(kwh.p,kwh.t)
frame<-data.frame(subj,treat,kwh)
colnames(frame)<-c("subjectid","treatment","may","june","july","august","september","october","november","december","january","february","march","april","may","june","july","august")
#install.packages("arm")
#install.packages("reshape")
library(reshape)
library(arm)
kwh.r<-reshape(frame,varying=c("may","june","july","august","september","october","november","december","january","february","march","april","may","june","july","august"),v.names="kwh",timevar="month",direction="long",times=c("may","june","july","august","september","october","november","december","january","february","march","april","may","june","july","august"),new.row.names=1:(16*600))
hlm<-lmer(kwh~treatment+(1|subjectid)+(1|month),data=kwh.r)
@ 

Table~\ref{tab:kwh} shows the average monthly kWh use (standard deviations in parentheses) for the treatment and placebo group for all 16 months in the study.  As can be seen, the treatment group had significantly lower average kWh use than the placebo group in every month. 

\begin{table}[h]
  \centering
  \caption{Average Monthly kWh Use for Treatment IHD and Placebo Group}
  \label{tab:kwh}
  \begin{tabular}{c c c c c}
    Month & Treatment & Placebo & t-test (p-value) & Cohen's d \\ \hline
    May, 2012 & \Sexpr{prettyNum(mean(use1.t))} (\Sexpr{prettyNum(sd(use1.t))}) & \Sexpr{prettyNum(mean(use1.p))} (\Sexpr{prettyNum(sd(use1.p))}) & \Sexpr{prettyNum(abs(t.test(use1.t,use1.p)$statistic))} (\Sexpr{ztrunc(t.test(use1.t,use1.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use1.t,use1.p)$statistic/sqrt(600)))} \\ 
    June, 2012 & \Sexpr{prettyNum(mean(use2.t))} (\Sexpr{prettyNum(sd(use2.t))}) & \Sexpr{prettyNum(mean(use2.p))} (\Sexpr{prettyNum(sd(use2.p))}) & \Sexpr{prettyNum(abs(t.test(use2.t,use2.p)$statistic))} (\Sexpr{ztrunc(t.test(use2.t,use2.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use2.t,use2.p)$statistic/sqrt(600)))}   \\ 
    July, 2012 & \Sexpr{prettyNum(mean(use3.t))} (\Sexpr{prettyNum(sd(use3.t))}) & \Sexpr{prettyNum(mean(use3.p))} (\Sexpr{prettyNum(sd(use3.p))}) & \Sexpr{prettyNum(abs(t.test(use3.t,use3.p)$statistic))} (\Sexpr{ztrunc(t.test(use3.t,use3.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use3.t,use3.p)$statistic/sqrt(600)))}   \\ 
    August, 2012 & \Sexpr{prettyNum(mean(use4.t))} (\Sexpr{prettyNum(sd(use4.t))}) & \Sexpr{prettyNum(mean(use4.p))} (\Sexpr{prettyNum(sd(use4.p))}) & \Sexpr{prettyNum(abs(t.test(use4.t,use4.p)$statistic))} (\Sexpr{ztrunc(t.test(use4.t,use4.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use4.t,use4.p)$statistic/sqrt(600)))}   \\ 
    September, 2012 & \Sexpr{prettyNum(mean(use5.t))} (\Sexpr{prettyNum(sd(use5.t))}) & \Sexpr{prettyNum(mean(use5.p))} (\Sexpr{prettyNum(sd(use5.p))}) & \Sexpr{prettyNum(abs(t.test(use5.t,use5.p)$statistic))} (\Sexpr{ztrunc(t.test(use5.t,use5.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use5.t,use5.p)$statistic/sqrt(600)))}   \\ 
    October, 2012 & \Sexpr{prettyNum(mean(use6.t))} (\Sexpr{prettyNum(sd(use6.t))}) & \Sexpr{prettyNum(mean(use6.p))} (\Sexpr{prettyNum(sd(use6.p))}) & \Sexpr{prettyNum(abs(t.test(use6.t,use6.p)$statistic))} (\Sexpr{ztrunc(t.test(use6.t,use6.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use6.t,use6.p)$statistic/sqrt(600)))}   \\ 
    November, 2012 & \Sexpr{prettyNum(mean(use7.t))} (\Sexpr{prettyNum(sd(use7.t))}) & \Sexpr{prettyNum(mean(use7.p))} (\Sexpr{prettyNum(sd(use7.p))}) & \Sexpr{prettyNum(abs(t.test(use7.t,use7.p)$statistic))} (\Sexpr{ztrunc(t.test(use7.t,use7.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use7.t,use7.p)$statistic/sqrt(600)))}   \\ 
    December, 2012 & \Sexpr{prettyNum(mean(use8.t))} (\Sexpr{prettyNum(sd(use8.t))}) & \Sexpr{prettyNum(mean(use8.p))} (\Sexpr{prettyNum(sd(use8.p))}) & \Sexpr{prettyNum(abs(t.test(use8.t,use8.p)$statistic))} (\Sexpr{ztrunc(t.test(use8.t,use8.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use8.t,use8.p)$statistic/sqrt(600)))}   \\ 
    January, 2013 & \Sexpr{prettyNum(mean(use9.t))} (\Sexpr{prettyNum(sd(use9.t))}) & \Sexpr{prettyNum(mean(use9.p))} (\Sexpr{prettyNum(sd(use9.p))}) & \Sexpr{prettyNum(abs(t.test(use9.t,use9.p)$statistic))} (\Sexpr{ztrunc(t.test(use9.t,use9.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use9.t,use9.p)$statistic/sqrt(600)))}   \\ 
    February, 2013 & \Sexpr{prettyNum(mean(use10.t))} (\Sexpr{prettyNum(sd(use10.t))}) & \Sexpr{prettyNum(mean(use10.p))} (\Sexpr{prettyNum(sd(use10.p))}) & \Sexpr{prettyNum(abs(t.test(use10.t,use10.p)$statistic))} (\Sexpr{ztrunc(t.test(use10.t,use10.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use10.t,use10.p)$statistic/sqrt(600)))}   \\ 
    March, 2013 & \Sexpr{prettyNum(mean(use11.t))} (\Sexpr{prettyNum(sd(use11.t))}) & \Sexpr{prettyNum(mean(use11.p))} (\Sexpr{prettyNum(sd(use11.p))}) & \Sexpr{prettyNum(abs(t.test(use11.t,use11.p)$statistic))} (\Sexpr{ztrunc(t.test(use11.t,use11.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use11.t,use11.p)$statistic/sqrt(600)))}   \\ 
    April, 2013 & \Sexpr{prettyNum(mean(use12.t))} (\Sexpr{prettyNum(sd(use12.t))}) & \Sexpr{prettyNum(mean(use12.p))} (\Sexpr{prettyNum(sd(use12.p))}) & \Sexpr{prettyNum(abs(t.test(use12.t,use12.p)$statistic))} (\Sexpr{ztrunc(t.test(use12.t,use12.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use12.t,use12.p)$statistic/sqrt(600)))}   \\ 
    May, 2013 & \Sexpr{prettyNum(mean(use13.t))} (\Sexpr{prettyNum(sd(use13.t))}) & \Sexpr{prettyNum(mean(use13.p))} (\Sexpr{prettyNum(sd(use13.p))}) & \Sexpr{prettyNum(abs(t.test(use13.t,use13.p)$statistic))} (\Sexpr{ztrunc(t.test(use13.t,use13.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use13.t,use13.p)$statistic/sqrt(600)))}   \\ 
    June, 2013 & \Sexpr{prettyNum(mean(use14.t))} (\Sexpr{prettyNum(sd(use14.t))}) & \Sexpr{prettyNum(mean(use14.p))} (\Sexpr{prettyNum(sd(use14.p))}) & \Sexpr{prettyNum(abs(t.test(use14.t,use14.p)$statistic))} (\Sexpr{ztrunc(t.test(use14.t,use14.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use14.t,use14.p)$statistic/sqrt(600)))}   \\ 
    July, 2013 & \Sexpr{prettyNum(mean(use15.t))} (\Sexpr{prettyNum(sd(use15.t))}) & \Sexpr{prettyNum(mean(use15.p))} (\Sexpr{prettyNum(sd(use15.p))}) & \Sexpr{prettyNum(abs(t.test(use15.t,use15.p)$statistic))} (\Sexpr{ztrunc(t.test(use15.t,use15.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use15.t,use15.p)$statistic/sqrt(600)))}   \\ 
    August, 2013 & \Sexpr{prettyNum(mean(use16.t))} (\Sexpr{prettyNum(sd(use16.t))}) & \Sexpr{prettyNum(mean(use16.p))} (\Sexpr{prettyNum(sd(use16.p))}) & \Sexpr{prettyNum(abs(t.test(use16.t,use16.p)$statistic))} (\Sexpr{ztrunc(t.test(use16.t,use16.p)$p.value)}) & \Sexpr{prettyNum(2*abs(t.test(use16.t,use16.p)$statistic/sqrt(600)))}   \\ \hline     
  \end{tabular}
\end{table}

A varying-intercept model was fit allowing each participant and month to have its own intercept \cite{gelman2010arm}.  Across the sixteen month period, participants in the treatment IHD group used on average \Sexpr{prettyNum(abs(fixef(hlm)[2]))} less kWh each month than those in the placebo group $t(598)=$ \Sexpr{prettyNum(abs(fixef(hlm)[2]/sqrt(diag(vcov(hlm))[2])))}, $p<$ \Sexpr{ztrunc(dt(fixef(hlm)[2]/sqrt(diag(vcov(hlm))[2]),598))}.

\subsubsection{Ancillary Analyses}
A simple, one-minute questionnaire includes items that were tested for ability to predict the volunteering enrollment decision.  Participants were offered a \$2 bill \cite{doody2003randomized,booker2011systematic} for completing the questionnaire.  The items of the questionnaire were then used to develop a propensity score model to adjust for volunteer bias.  The propensity score model adjusted the treatment effect model.\footnote{We omit the mediation and propensity score model here for brevity, but these would be reported in the actual paper.}

\subsubsection{Harms}
Participants were provided with a 1-800 number to register difficulties or complaints. A small number of participants in both the treatment and placebo group reported increased monthly electricity bills due to the addition of the in-home display.  No other complaints related to participation in the study were registered.

\subsection{Other Information}
\subsubsection{Registration}
The trial was registered with the WHO international Clinical Trials Registry Platform (here).
\subsubsection{Protocol}
Prior to conducting the trial, the protocol (available here (link)) was peer reviewed, published in the journal Trials.
\subsubsection{Funding and Responsibilities}
The project involved collaboration between the in-home display vendor, a utility company, and an academic university.  The creation of the experimental design, construction of the sampling frame, sampling procedures, randomization procedures, blinding procedures, as well as statistical analyses and creation of reports were the sole responsibility of the university, as was the responsibility for ensuring that the protocol was administered, amended, and revised appropriately.\footnote{\url{http://cancercenters.cancer.gov/}}  The creation of the in-home display content is the joint work of the university, utility, and vendor.  The creation of communications with participants, including baseline, volunteer, psychodemographic and follow-up surveys were created jointly by the utility and the university.  The monitoring and evaluation of the trial was done jointly by the utility, vendor, and university.  The vendor covered the cost of the development and production of the in-home displays, whereas the utility covered the logistical costs of the trial, including smart-meters, phone centers, and mailing, and the university covered the costs of employing academic researchers.

\section{Costs of a Gold Standard RCT}
One primary consideration in conducting a field trial will always be financial constraints. While there is very limited public data on the costs of pilot studies, estimates put the average cost to the utility of an in-home display trial at approximately \$500 per-household [insert ACEE citation]. Table~\ref{tab:costs} shows estimates of the increased costs for designing and implementing an in-home display trial that adheres to our gold standard. 

\begin{table}[h]
\centering
\caption{Estimated additional costs for in-home display trial to meet the gold standard.  Planning costs assume skilled labor at \$50 per hour.  Fixed costs only need to be paid one time for the entire study.  Marginal costs are costs per participant.}
\label{tab:costs}
\begin{tabular}{c p{7cm}}
Study Design Category & Additional Cost \\ \hline 
Fixed Costs & \\ \hline
Systematic Review & $\sim$ 100-200 hours planning \\
Sequence Generation & No cost \\
Allocation Concealment & $\sim$ 10-20 hours planing \\
Blinding & $\sim$ 10-20 hours planning \\
Eligibility Exclusions & $\sim$ 30-50 hours planning \\
Volunteer Adjustment & $\sim$ 50-100 hours planning  \\
Withdrawal Prevention & $\sim$ 50-100 hours planning \\ 
Reporting & $\sim$ 10-20 hours labor \\ \hline
Sum & $\sim$ \$13,000-\$26,000 (260-510 hours) \\ \hline
& \\
Marginal Costs &  \\ \hline
Placebo Control Group & $\sim$ 2$\times$ as many participants \\
Volunteer Adjustment & $\sim$ \$5-\$10 per participant \\
Withdrawal prevention & $\sim$ \$5-\$10 per participant \\ \hline
Sum & $\sim$ \$1,010-\$1,020 per treatment-control pair \\ \hline
\end{tabular}
\end{table}

Additional costs are primarily equipment-based and do no take into account economies of scale (i.e. equipment may be discounted at higher bulk purchase levels). What we do not present, but is a consideration, is the future cost of having to conduct additional trials as a result of poor evidence collection or the even more considerable cost of rolling out a technology or behavioral intervention that would not have been indicated as effective if evaluated with a better initial pilot study.

\section{Conclusion}
In this paper we have made the argument that electricity industry decision-makers face decisions about whether to implement new technologies that can affect consumer behavior in the same way that the FDA must make drug or new device regulatory decisions. Just as the FDA requires a high standard of evidence for approval, so too should electricity industry decision-makers require a high standard of evidence before investing in new technologies or behavioral interventions.  This standard of evidence is the randomized controlled effectiveness trial.  We detailed how these trials are conducted in biomedical research, and have provided an example for how they can be translated to research on electricity consumption behavior, using in-home displays as an example.  Finally, we provide two checklists in~\ref{checklists} to help researchers implement these standards, as checklists have been shown to be very effective in implementing well-known procedures \cite{gawande2010checklist,pronovost2006intervention}.  We hope that those who conduct research on human behavior in electricity learn and use these standards, and that before investments are made, policy-makers demand the high level of evidence that their customers, constituents, the public, and scientific community deserve.

\clearpage
\appendix
\section{Checklists}
\label{checklists}
\begin{table}[h]
\caption{Background, Internal Validity, and External Validity}
\label{tab:check1}
\begin{tabular}{p{12cm} |c| |c|}
Item & Yes & No \\ \hline
\underline{Background} & & \\
Has a systematic review been completed? & & \\
Has a meta-analysis been completed? & & \\
Has the methodological quality of prior evidence been accounted for? & & \\
Have objectives been clearly stated as hypotheses? & & \\
Have plausible mechanisms been identified and facilitated by design? & & \\
& & \\
\underline{Internal Validity} & & \\
Has a concurrent control group been used? & & \\
Is the control group identical to the treatment group? & & \\
Is the placebo indistinguishable from the treatment? & & \\
Has equipoise been established? & & \\
Is randomization used? & & \\
Is the randomizing sequence truly random? & & \\
Is random allocation adequately concealed? & & \\
Are participants blinded to their group? & & \\
Are personnel, data collectors, and data analysts blinded? & & \\
Are contacts and visits balanced to maintain blinding? & & \\
& & \\
\underline{External Validity} & & \\
Is the sample drawn randomly from the population? & & \\
Are the eligibility criteria clear, pre-specified, and minimal? & & \\
Are justifications for exclusion criteria provided? & & \\
Is a propensity score model used to account for volunteering? & & \\
Do exclusion criteria minimize chances of withdrawal? & & \\
Are measurements likely to be completed before withdrawal? & & \\ \hline
\end{tabular}
\end{table}

\begin{table}[h]
\caption{Statistical Validity and Reporting}
\label{tab:check2}
\begin{tabular}{p{12cm} |c| |c|}
Item & Yes & No \\ \hline
\underline{Statistical Validity} & & \\
Are data analysts blinded? & & \\
Are statistical analyses preplanned and published prior to study commencement? & & \\
Are pre-planned and post-hoc analyses clearly identified? & & \\
Has a power analysis been conducted? & & \\
Has a sample size calculation been conducted? & & \\
Are the rules for early stopping set forth ahead of time? & & \\
Has a whole-sample analyses been done? & & \\
Has intention-to treat been used? & & \\
Has imputation been used? & & \\ 
& & \\
\underline{Reporting} & & \\
Does the trial follow the CONSORT statement? & & \\
Are data made publicly available? & & \\
Are statistical analyses publicly available and reproducible? & & \\
Are study materials publicly available and reproducible? & & \\ \hline
\end{tabular}
\end{table}

\clearpage
\section{Acknowledgement}
This work was supported by the center for Climate and Energy Decision Making (SES-0949710), through a cooperative agreement between the National Science Foundation and Carnegie Mellon University.

This material is based upon work supported by the Department of Energy under Award Numbers DE-OE0000300 and DE-OE0000204.  Disclaimer: This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights, Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

\clearpage
\renewcommand{\refname}{\CHead{Endnotes:}}
\bibliographystyle{unsrt}
\bibliography{/home/alex/Dropbox/masterbib}
\end{document}

Nothwehr, 2006; This study examines whether changes in goal setting frequency predict changes in use of behavioral strategies over time, controlling for baseline strategy use, demographics and whether a person was trying to lose weight. What will happen if we have participants set reduction goals more frequently than just once a month?  Would people adjust their goals relative to their confidence in attaining the set goal? Would it lead people to reduce less than they can actually reduce because they're undercutting their reduction goals in fears that they won't attain them?  What happens when the reduction goals they set are very miniscule (i.e. reduce by 0.5\% over the course of a day/month) Would they actually hit them because they're attainable? Or would those reduction goals become so small that they don't want to put in the effort to reduce?  Goal setting frequency was found to be strongly and positively associated with use of the strategies measured, both at baseline and over-time.  Goal setting specifically related to diet or physical activity was, in most cases, more strongly associated with the corresponding strategies than goal setting related to body weight.  Self-monitoring appears to be quite strongly associated with goal setting frequency.  Results suggest that setting more specific goals for diet or physical activity is generally more strongly associated with strategy use than setting weight-related goals.

Atkinson (1958) ÃÂ¢ÃÂÃÂ task difficulty, measured as a probability of task success, was related to performance in a curvilinear, inverse function. The highest level of effort occurred when the task was moderately difficult, and the lowest levels occurred when the task was either very easy or very hard. (What kind of tasks were people asked to perform?) Goal difficulty effect sizes (d) in meta-analyses ranged from 0.52 to 0.82 (Locke & Latham, 1990). Performance leveled off or decreased only when the limits of ability were reached or when commitment to a highly difficult goal lapsed (Erez & Zidon, 1984). Goal specificity reduces variation in performance.  How is goal difficulty measured for energy saving tasks? Do people who sign up for energy conservation programs imagine difficulty level to be easy, but when put in a field setting consider goals to be difficult? Does difficulty level of these goals increase or decrease in time? 

Social-cognitive theory: self-efficacy (task-specific confidence) is measured by getting efficacy ratings across a whole range of possible performance outcomes rather than from a single outcome. When goals are self-set, people with high self-efficacy set higher goals than do people with lower self-efficacy. They are also more committed to assigned goals, find and use better task strategies to attain the goals, and respond more positively to negative feedback than do people with low self-efficacy (Locke & Latham, 1990; Seijts & B. W. Latham, 2011). 

When people are trained in proper strategies, those given specific high-performance goals are more likely to use those strategies than people given other types of goals; hence, their performance improves (Earley & Perry, 1987).

Goal-setting theory appears to contradict Vroom's (1964) valence-instrumentality-expectancy theory. It states that the force to act is a multiplicative combination of valence (anticipated satisfaction), instrumentality (the belief that performance will lead to rewards), and expectancy  (the belief that effort will lead to performance needed to attain the rewards).

Directive function: goals direct attention and effort toward goal-relevant activities and away from goal-irrelevant activities. Rothkopf and Billington (1979) found students with specific learning goals paid attention  to and learned goal-relevant prose passages better than goal-irrelevant passages. 

Locke and Bryan (1969) people who were given feedback about multiple aspects of their performance on an automobile-driving task improved their performance on the dimensions for which they had goals but not on other dimensions.

Energizing function: high goals lead to greater effort than low goals. This has been shown with tasts that directly entail physical effort, entail repeated performance of simple cognitive tasks, such as addition, include measurements of subjective effort and include physiological indicators of effort.  

Goals affect persistence: when participants are allowed to control the time they spend on a task, hard goals prolong effort (LaPorte & Nath, 1976). There is often a trade-off in work between time and intensity of effort. Faced with a difficult goal, it is possible to work faster and more intensely for a long period. Tight deadlines lead to a more rapid work pace than loose deadlines in the laboratory (Bryan & Locke, 1967b) as well as in the field (Latham & Locke, 1975). 

Goals affect action indirectly by leading to the arousal, discovery, and/or use of task-relevant knowledge and strategies  When confronted with task goals, people automatically use the knowledge and skills they have already acquired that are relevant to goal attainment.  If the path to the goal is not a matter of using automatized skills, people draw from a repertoire of skills that they have used previously in related contexts, and they apply them to the present situation. 
If the task for which a goal is assigned is new to people, they will engage in deliberate planning to develop strategies that will enable them to attain their goals (Smith, Locke, & Barry, 1990).

People with high self-efficacy are more likely than those with low self-efficacy to develop effective task strategies (Latham, Winters, & Locke, 1994; Wood & Bandura, 1989). There may be a time lag between assignment of the goal and the effects of the goal on performance, as people search for appropriate strategies (Smith et al., 1990). 

When people are confronted with a task that is complex for them, urging them to do their best sometimes leads to better strategies (Earley, Connolly, & Ekegren, 1989) than setting a specific difficult performance goal. Setting specific challenging learning goals, such as to discover a certain number of different strategies to master the task (Seijts & G. P. Latham, 2001; Winters & Latham, 1996). Does this apply to learning goals?

Covington, 2000

However, habits can result in a rut where simple, easier solutions are not discovered or even sought out.  Thus, changing habits is important but difficult (Gifford, 2011).  Habits may need to be distrupted and replaced for energy savings to be possible \cite{fischer2008feedback}.  Habit strength and automatic, procedural goals are habits using response-frequency measures \cite{aarts2000habits,steg2009encouraging}.  

\subsection{Self-Efficacy}

Self-efficacy proposes that meta-cognitive evaluations and self-regulation matter.  Getting things done increases our perception that we are able to affect and control the environment to realize our intentions.  This gives people ``self-satisfaction and a sense of pride and self-worth'' and as a consequence they ``refrain from behaving in ways that give rise to self-dissatisfaction, self-devaluation, and self-censure.'' \cite{bandura2001social}.  A well designed display that is effective and easy to use can increase motivation by increasing feelings of self-efficacy and competence \cite{thogersen2010electricity,bandura2001social}.  Devices that reduce the gulf between intentions and allowable actions, as well as make it easy to understand what the current state of consumption is, are likely to be very effective by enhancing self-efficacy \cite{norman2002design}.

Learned helplessness, on the other hand, is a pervasive feeling that one cannot control one's environment (Abramson, Seligman, and Teasdale, 1978; Lazarus and Folkman, 1984; Weiner, 1985; Dweck and Legett, 1988; Deci and Ryan, 2000).  A poorly designed device will make people feel that they are helpless in understanding and manipualting their electricity use, where they feel that.  As a result people will stop interacting with the device, because,``if you fail at something, you think it is your fault.  Therefore you think you cant do that task.  As a result, next time you have to do the task, you believe you cant so you don't even try.  The result is that you cant, just as you thought. You're trapped in a self-fulfilling prophecy'' \cite{norman2002design}.  A complex, confusing, and uninformative display design is worse than no display at all, as bad design reinforces learned helplessenss and undermines self-efficacy.  If saving energy is difficult, is full of obstacles, or information seems hard to comprehend, people with high self-efficacy are likely to persevere and succeed, resulting in energy savings, wereas those who believe they cannot do this, or subscribe to a fatalist perspective (Gifford, 2011), are unlikely to see benefits from the IHD.  If one feels it is too difficult to reduce wasted electricity through intentional action, then electricity may be ÃÂ¢ÃÂÃÂsquanderedÃÂ¢ÃÂÃÂ \cite{thogersen2010electricity}.  

\subsection{Goals}

There is very strong evidence on the effectiveness of goals.  Studies involving a large number of participants, using a variety of tasks, conducted in many different countries show that setting specific and difficult goals results in better performance than instructing participants to do their best (irrespective of whether the goals are self-set or set by an external source) (Wood and Newborough, 2007).  Goals can be set as hedonic goal-frames (e.g., enjoying leisure), gain goal-frames (e.g., save money), and normative goal-frames (e.g., do what the neighbors would approve of) (Steg and Vlek, 2009).  Goals should not be too small, too large, and need to be customized to the potential savings of the household, and should be accompanied with advice or tips for meeting the goal, and feedback on whether they are meeting it (Karjalainen, 2011).  The source of the goal and the timeframe used to achieve the goal may matter (Consolvo, 2009).  When failing ot meet a goal, people adjust their goal downward to meet their performance, and this decrement is larger than the increase in performance goals when meeting their goals (Illies, 2005).  The difficulty of self-set goals and commitment to achieving them is related to conscientiousness (Klein, 2006).  Goal setting is also likely related to perceptions about whether one needs to meet a goal or not,; that is whether one see's one's behavior as wasteful or inefficient.  If goals are challenging and no goal progress is evident, then participants may become discouraged and withdraw from task effort, and commitment in the face of failure to meet one's goal is related to self-efficacy (West, 2005).  Thus, for people low in self-efficacy it might be necessary only to give people goal feedback when they are performing well to keep them motivated.  Feedback that one is approaching a specific goal successfully can motivate attaining that goal.  Locke (2002) argues that the more difficult the goal, the greater the effort toward that goal, and that specific goals do not improve performance but reduce variability in performance.  Expectations about whether one can achieve one's goals matters for performance, where expecting that one cannot achieve a goal harms performance, but the difficulty of the goal matters more.  Self-efficacy is likely to matter more for goals that are self-set than goals that are set by another person.  Manageable short term goals are more effective than when aggregated together into a single long-term goal (McMillian and Sparkes, XXXX?).  Hard goals result in higher performance than "do your best" or vague goals is related to the ambiguity inherent in vague goals (Locke and Latham, 1990). This ambiguity allows individuals to justify to themselves that they have tried hard enough at a point that falls lower than the performance level of someone who is trying for a specific and challenging goal.


\subsection{Mapping Behaviors to Consumption}

When customers learn what actions they can take, and how each action affects their electricity use, they map behaviors to consumption.  Rather than simply increasing awareness by ``drawing attention to the cost of energy'' \cite{mcclelland1979energy}, the IHD ``teach[es] residents what activities consume the most energy'' .  People interacting with a device initially did not know what used the most electricity, ``P1 --- realized the A/C uses a lot more power than he initially suspected; P3 was surprised that the TV didn't ---have a whole lot of effect; and, P5 did not anticipate that the clothes dryer would have such significant impact on consumption''\cite{yun2009investigating}.

There is conflicting evidence, however.  People have difficutly figuring out what needs to be done, even if they know what uses the most energy.  That is they ``do not quite know what they can do to reduce their electricity consumption'' \cite{eiden2009investigation}.  If this is the case, then energy-saving tips would have to accompany the IHD for it to be effective.

\subsection{Energy Efficient Appliance Purchases}

In-home displays may work indirectly by encouraging energy efficient appliance purchases, but otherwise have no effect on curtailment behaviors.  By learning about total consumption, cost, and appliance-specific use, households may decide that the best course of action is to purchase an energy efficient appliance.  This leads to a ``one-time decision to initiate the retrofit'' \cite{seligman1978behavioral}.  \citeA{dobson1992conservation} found that users of the RECS knew more about what appliances consume energy and that this caused them to purchase more efficient appliances.  \citeA{yun2009investigating} found that two households replaced high-power consuming devices with more energy efficient devices.

\subsection{Control State Maintenance}

Because electricity is consumed implicitly in a variety of contexts and modes, there is no ``concise cognitive frame'' of electricity consumption that makes sense and is usable to people \cite{fischer2008feedback}.  In-home displays can help give context to the user's electricity consumption. The IHD helps consumers maintain an optimal consumption state and avoid waste \cite{seligman1978behavioral}.  Control states focus peoples' attention to specific actions and exactly when they are appropriate.  The IHD may simplify the energy consumption tasks by telling customers exactly what to do and when.  \citet{yun2009investigating} observed of his participants: ``P2 independently made the decision to keep his ECD between 2 and 3 lights.  When the display was higher, he reported setting out to investigate, turning off devices along the way.  P5 reduced her household's average daily consumption by more than 50\% by pursuing the goal of never having the ECD blink.''  It may make a difference whether the control states or goals are self-imposed or encouraged externally \cite{yun2009investigating}.

Carver and Scheier, 1981 action theory (e.g., Frese & Zapf, 1994),

Frese, M., & Zapf, D. (1994). Action as the core of work psychology: A German approach. In H. C. Triandis, M. D. Dunnette, & L. M. Hough (Eds.), Handbook of industrial and organizational psychology (2nd ed., Vol. 4, pp. 271-340). Palo Alto, CA: Consulting Psychologists Press.
action identification theory (Vallacher & Wegner, 1987)
Vallacher, R. R., & Wegner, D. M. (1987). What do people think they're doing? Action identification and human behavior. Psychological Review, 94, 3-15.
variant of learned helplessness theory (e.g., Mikulincer, 1994),

\subsection{Encouraging Sustainable Behaviors}

The in-home display may encourage sustainable behaviors by helping people engage in behaviors that they would like to commit to but have thus far failed to implement (\cite{yun2009investigating} Chetty et al, 2008; getting to green). [6] McCalley, L. T. From motivation and cognition theories to everyday applications and back again: the case of product-integrated information and feedback. Energy Policy 34, no. 2 (January 2006): 129-137.  \citeA{paetz2011shifting} found that although habits did not change, attitudes toward electricity use did, making people feel guilty about wasting electricity, ``This load curve has changed my attitudes. At least I know how much power the coffee machine needs. This morning for example I turned it off, because I knew my roommate was still asleep and I thought it doesn't have to run for another hour without need. (T1 interview)'' \cite{paetz2011shifting}.

\subsection{Awareness/Consciousness}




\subsection{Play/Games}

Interacting with the IHD and saving energy can be fun.  Some households naturally create energy saving games, such as trying to find ``how low can you go'' in electricity consumption \cite{yun2009investigating}.  Another example of game play comes from participants living in a smart energy home.  They blogged their daily activities in the smart home, engaging in play behavior: ``For half an hour I have turned on as many appliances as possible, even my hair curler. I was impressed by 7000 Watts and no shortage ;-), but shocked that the hoover needed 4000 Watts power. It's like a game.'' (T2 blog) \cite{paetz2011shifting}.  Thus, in-home displays may be effective to the extent that they promote an environment for fun or ability to play with one's electricity through manipulation and feedback.  The device can also include pre-programmed games, as the ECI did \cite{hutton1986effects}.  

Garris and Ahlers, 2002

\subsection{Net-Benefit Calculations}

The perceived costs and benefits of curtailing electricity use will likely determine whether those behaviors are undertaken \cite{mckenzie2011fostering}. People intuitively perform net-benefit calculation calculations of behaviors.  When the effort and financial costs required to take energy conservation acts outweighs the perceived benefits, people won't engage in the behavior.  For example, in interviews, residential customers found that they could quickly identify and provide a ``comprehensive list of energy saving measures (e.g., cavity wall insulation) and behavioural changes (e.g., turning your thermostat down by 1 degree C)'', but didn't do it because net-benefit was perceived to be negative \cite{roberts2004consumer}.  In-home displays can correct inaccurate perceptions of net cost and benefit, thus promoting conservation behaviors \cite{steg2009encouraging}.

\subsection{Measurement}
The measurement instruments, the measure criteria should be precisely specified for each endpoint and ancillary measure.  Additional reccurrent measurements should be specified.  For most studies there would need to be a baseline measurement, before treatment begins, and follow-up measurement for some specified durations after treatment ends.  The validity (e.g., construct, content, predictive) and reliability of each measure should be discussed and ensured.

Surrogate endpoints: \cite{fleming1996surrogate} Surrogates can show benefit but mortality increases \cite{barter2007effects}.  Possible to increase electricity knowledge but also increase consumption. \cite{fleming1994surrogate} \cite{prentice2006surrogate}

Smaller, higher-powered studies can be conducted with surrogate endpoints that are assessed more frequently \cite{prentice2006surrogate}.  Such endpoints need to be justified based on ability to predict the primary endpoint, theoretical reasons, or statistical reasons.  More generally a valid surrogate is sensitive to the treatment intervention if and only if the ``true'' endpoint is also sensitive (need to reread the prentice paper).  This is also proxy variable (wooldridge?).

Psychometrics

Baseline data must be collected before the intervention begins, and can include questionnaires, interviews, and other tests.  Checking for comparability at baseline

\section{Implementation}
It is not enough to have a conceptually valid experiment according to internal, external, and statistical validity.  The experiment must be implemented as intended for this conceptual validity to hold weight. 

\subsection{Protocol Development}
The NCI provides templates for protocol development that includes many important issues discussed so far.\footnote{\url{http://ctep.cancer.gov/protocolDevelopment/docs/Generic_Protocol_Template_for_Cancer_Treatment_Trial.docx}}\footnote{\url{http://ctep.cancer.gov/protocolDevelopment/templates_applications.htm}}  

A protocol is most powerful once it is developed if it is submitted to peer review and published \emph{before} data collection begins.  Several journals\footnote{e.g., \url{http://www.trialsjournal.com/}} publish study protocols, and this provides both the opportunity to get valuable feedback on the research design from experts as well as makes the aims and design of the research public before the results are known, which clearly delineates prior from post-hoc design changes and statistical analyses.

\subsection{Treatment Plan}
The treatment plan consists of what the intervention is composed of, the schedule for its implementation, maintenance, and measurements, and the duration of the intervention (including discontinuation due to harm or futility).  Careful attention to the treatment plan can make sure that those who receive the treatment and those in the control group are treated in exactly the same way

\subsection{Administration of Treatment}
A specific procedure needs to be created to maintain blinding, how blinding can be protected, how blinding can be broken and who is authorized to do so, why the blind is broken, and the procedure for breaking blind.

An interesting element is placebo run-in period, where all participants are given a placebo for a fixed duration before the treatment participants are switched to the treatment group.  This allows assessment of both placebo effects and compliance.

\subsection{Randomization and Packaging}
The most effective randomization is using a central phone center \cite{beta1982randomized}.
The intervention assignment should begin as soon as possible after consent.  THe recommendation is blocked randomization with stratification by location, but not by demographics or other factors.

``Give specific details on how a participant will be registered in a trial. For randomized trials, describe the procedure for randomizing a participant to a dose group. (May refer to Section 13.3).''

``Protocols using non-DCP supplied agents: describe in detail how the agent will be packaged and distributed, including container, amount of agent per container, container label information, and if blinded, how the label will be constructed to maintain the blind. Label information should include dose, number of doses per day, time of day for dosing, with or without food, and any other specific instructions.''

\subsection{Study Calendar}
The study calendar is a matrix, with study events (e.g., physical exam) as rows and the timing of the beginning of the event (e.g., week 3) as columns.  Specific events may need to be completed within some specific time period relative to other events, for example baseline information must be collected before the treatment has been implemented for one week.

\subsection{Adverse Events List and Reporting}
A list of actual and potential adverse events, either previously observed, theoretically plausible, or otherwise expected needs to be created and documented.\footnote{\url{http://ctep.cancer.gov/protocolDevelopment/electronic_applications/adeers.htm}}  A comprehensive list of adverse events and potential risks (CAEPR) is constructed for each intervention.  Adverse events are undesirable circumstances.  These must be recorded and then judged whether they are attributable to the intervention or not, with varying degrees of certainty (possible, probable, definite), as well as severity of the event (e.g., none, mild, moderate, severe, etc.).  Only mild or moderate adverse events that are either unrelated or unlikely to be related to the treatment do not need to be reported, all other likelihoods and severities need to be reported.

\begin{table}[h]
\caption{Recruitment and Installation Protocol}
\begin{tabular}{c p{11cm} c}
Time &  Action & Group \\ \hline
0 & Mail Recruitment Docs & CMU \\
0 & Mail Volunteer Survey, schedule phone call & CMU \\
Day 7 & Call non-responders (how to get phone \# ?) & CMU \\
Day 7 & Postcard to non-responders & CMU \\
Day 14 & Call non-responders & CMU \\
Day 14 & Randomization & CMU \\
Day 14 & Request certificate from SilverSprings for treatment group & Pepco \\
Day 14 & Mail 200 Frames w/ Welcome Package & Ceiva \\
Day 21? & {\bf Call Part 1} & CMU \\ \hline
& CMU will read a short informed consent document and get verbal consent from participant.  CMU will then pass the call to Deepa for instructions for pairing the frame to AMI. &  \\
Day 21 & {\bf Call Part 2} & Pepco \\ \hline
& Deepa will instruct the participant to pair the frame to the meter (what will these be? can we get a protocol?).  Call gets passed to Ceiva. & \\
Day 21 & {\bf Call Part 3} & Ceiva \\ \hline
& Ceiva will help participants set up the frame with photos and other stuff (what will this be?; can we get a copy of their protocol?). & \\
Day 21 & {\bf Call Part 4} & CMU \\ \hline
& CMU will have participants complete the baseline questionnaire.  After this the call will end. &  \\
Day X & {\bf Customer support} & All \\ \hline
& We need a way of triaging problems to CMU (problems with the trial), Pepco (problem with the meter/AMI) or Ceiva (problem with the frame).  CMU will probably take the call initially, try to diagnose whether it should go to CMU/Pepco/Ceiva (how should we do this? What questions should we ask?) &  \\ \hline
\end{tabular}
\end{table}

  The timeline, payments, and recruitment for these measures are in Table \ref{tab:survey}.  

\begin{table}[h]
\caption{Table of information about the four surveys and three treatment groups.}
\label{tab:survey}
\begin{tabular}{c c c c}
Item & Payment & Max Duration & Sample Size \\ \hline
Volunteer & \$2 & 5 minutes & 3200 \\
Baseline & \$5 & 10 minutes & 500 \\
Psycho-Demographics & \$10 & 20 minutes & 500 \\
Close-out & \$5 & 10 minutes & 500 \\ 
Placebo & \$25 Gift Certificate & 1 Year & 150 \\ 
Treatment & \$25 Gift Certificate & 1 Year & 150 \\ 
Control & NA & 1 Year & 150 \\ \hline
\end{tabular}
\end{table}

%Examples: \cite{ivy2010approaches} \cite{seymour2010design}\footnote{\url{http://ctep.cancer.gov/}}

\section{Implementation}
RCTs need to be both conceptually valid according to their design and implemented in a way that is faithful to the design.  If the plan for implementation, or standard operating procedures, are clearly set forth ahead of time and the study is monitored for adherence to these procedures, then one can make the argument in a published report that the actual study, not just the design, is valid.  The standard operating procedures are detailed in the \emph{Recruitment, Retention, and Adherence Plan}, overseen using the \emph{Data Safety and Monitoring Plan}, and communicated in the \emph{Reporting Plan}. 

\subsection{Recruitment, Retention, and Adherence Plan}
The Recruitment, Retention, and Adherence Plan (RRAP)\footnote{\url{http://dcp.cancer.gov/files/clinical-trials/rrap.doc}} describes what will be done to recruit participants before the treatment is administered, how participants will be maintained in the study to prevent withdrawal, and how they will be encouraged to adhere to the study protocol.  This involves a plan about what will be done before recruitment is initiated, such as the recruitment strategies that will be used, what will be done during recruitment to monitor and evaluate progress, and what will be done after recruitment to maintain and encourage participation \cite{dodge2008randomized}.

\subsubsection{Adherence}
Participants may enroll in the trial but not do what is required of them, instead failing to take medication in a drug trial even when it could cost them their lives, or not counting their calories in a weight loss program.  Adherence programs try to make sure that participants do what they are supposed to do in the trial \cite{claxton2001systematic}.

A systematic review of interventions to improve medication adherence \cite{viswanathan2012interventions} identified strong evidence for case management, behavioral and educational interventions \cite{friedman1996telecommunications,johnson2006efficacy,ogedegbe2012randomized,lin2006effects,friedman2010fundamentals,pathman1996awareness,osterberg2005adherence,peterson2003meta,haynes2008interventions}, and reminders \cite{fulmer1999intervention} as effective adherence approaches.\footnote{\url{http://www.who.int/mip/2003/progress/en/}, \url{http://effectivehealthcare.ahrq.gov/}}  These approaches are tailored to the specific problems participants are expected to have in the study \cite{cooper2003acceptability,leutz1999five}, including cultural, racial, and social issues \cite{gilliland1998recommendations,fisher2007cultural}.

One example of an effective case management approach comes from a trial promoting medication adherence among people with comorbid hypertension or type 2 diabetes and depression \cite{bogner2012integrated,rudd2004nurse,bogner2008integration}.  Over a four week period an integrated care manager met with African American participants with depression and type 2 diabetes three times in person for thirty minutes each and two times over the phone for fifteen minutes each \cite{bogner2010integrating}.\footnote{\url{http://www.annfammed.org/content/10/1/15/suppl/DC1}}  During these meetings participants were educated about the two diseases and the importance of controlling depression to help type 2 diabetes, were told about the rationale of the study and intervention, and were monitored for their progress and assisted with problems in the study.  The integrated care manager was trained to build rapport and be culturally competent \cite{beck2002physician,ferguson2002culture,schim2007culturally,kalb2006competency,rubin2011qualitative,lillie1995conducting}.  Costing only two hours per participant over a four week period, adherence greatly increased for those who were offered integrated case management compared to those with usual care. 

The problem with aggressive adherence programs is that they are unlikely to mimic the actual context of use outside of the trial.  Most patients will not have an integrated care program offered to them by their insurance provider, meaning that any benefit observed in the trial will not be realized in the real world.  Thus, pragmatic trials limit the use of adherence plans, and should not use them if they are not expected to be implemented in the ``real world.''

% Another study had a deleterious effect on adherence, that used a one hour in-person visit and two half-hour appointments monthly (one telephone and one in person) for 0-12 weeks, followed by monthly phone calls \cite{lin2006effects}.  Care managers used behavioral activations including exercise, goal-setting, and problem solving.
%Reducing out-of pocket costs 163. Chernew ME, Shah MR, Wegh A, Rosenberg SN, Juster IA, Rosen AB, et al. Impact of decreasing copayments on medication adherence within a disease management environment. Health Aff (Millwood). 2008;27:103-12. [PMID: 18180484] 164. Choudhry NK, Fischer MA, Avorn J, Schneeweiss S, Solomon DH, Berman C, et al. At Pitney Bowes, value-based insurance design cut copayments and increased drug adherence. Health Aff (Millwood). 2010;29:1995-2001. [PMID: 21041738] 165. Maciejewski ML, Farley JF, Parker J, Wansink D. Copayment reductions generate greater medication adherence in targeted patients. Health Aff (Mill- wood). 2010;29:2002-8. [PMID: 21041739] 166. Choudhry NK, Avorn J, Glynn RJ, Antman EM, Schneeweiss S, Toscano M, et al; Post-Myocardial Infarction Free Rx Event and Economic Evaluation (MI FREEE) Trial. Full coverage for preventive medications after myocardial infarction. N Engl J Med. 2011;365:2088-97. [PMID: 22080794]
% \cite{peterson2003meta} \cite{osterberg2005adherence}.   \cite{schillinger2003closing} \cite{probstfield1986successful} \cite{peto1977design} \cite{peto1976design} \cite{unutzer2002collaborative}

\subsection{Data and Safety Monitoring Plan}
The trial can fail in a number of critical ways, including the design, implementation of the protocol, recording of data, and statistical analysis \cite{baigent2008ensuring}.  The data and safety monitoring plan is a procedure for protecting participant confidentiality and ensuring that data are accurate and complete, by catching and fixing each type of error.

Different monitoring bodies govern each potential flaw.  Peer review of the protocol can prevent design flaws, training staff can prevent procedural and recording errors, and using central statistical monitor can avoid errors in statistical analysis or misconduct.\footnote{\url{http://dcp.cancer.gov/files/clinical-trials/Consortia_DSMP.doc}, \url{http://www.cancer.gov/clinicaltrials/learningabout/patientsafety/dsm-guidelines},\url{http://ctep.cancer.gov/protocolDevelopment/default.htm#cde_data_pol_cdus},\url{http://grants.nih.gov/grants/policy/hs/data_safety.htm}}

A key element of the data management plan is the \emph{case report form}, which is a participant-specific hard-copy document that includes each participant's unique identifier, treatment assignment code, verbatim descriptions of adverse events, and other necessary information.  The case report form is used to validate records entered into a database, along with data codes, types, data ranges, and missing values.\footnote{\url{https://www.dcp-consortium-cdt.org/default.asp}}  Modern web-based data entry systems can provide version control and assist in error detection \cite{winget2005web}, reducing time to data publication by as much as 33\% \cite{litchfield2005future}.

Quality assurance procedures evaluate whether the study follows the standard operating procedures of the study and other good clinical practice guidelines \cite{schuyl1999review,marks2001paradigm}.  Protocol changes, if they are needed, are reviewed to see if they were documented and made appropriately.\footnote{\url{http://ctep.cancer.gov/protocolDevelopment/default.htm.},\url{http://ctep.cancer.gov/protocolDevelopment/policies_deviations.htm},\url{http://ctep.cancer.gov/branches/ctmb/clinicalTrials/docs/2006_ctmb_guidelines.pdf}}  As data are collected they are monitored and checked for error using audits of information on adherence, baseline data, the primary response variables, and adverse events \cite{friedman2010fundamentals}.  Before the study is closed out all data and records are checked for accuracy, completeness, and cleanliness.
%An example of trial protocol changes \cite{greene1992cardiac}.
%Audits can be conducted \cite{soran2006centralized} \cite{weiss1998systems} \cite{peppercorn2008dilemma}.   
%Avoiding unnecessary complexities, such as complicated case report forms \cite{williams2006other}, and pretesting them to make sure researchers and participants can use them, can reduce error.  
%\cite{kahn1975randomized} WHO. Workshop on semantic interoperability prerequisites for efficient e-health systems. Information Society, Available from: <http://www.who.int/classifications/terminology/prerequisites.pdf/>, 2005. Harmonized semantics \cite{weng2007user} A coefficient of reliability can be calculated \cite{lachin2004role}.  

%To maintain blinding and ensure confidentiality, each participant needs a Treatment Assignment Code (TAC).  A central blinded data monitoring group has the corresponding Treatment Assignment Description (TAD) that describes whether each code corresponds to treatment or control.  The TAC-TAD codings are securely stored and any modifications to treatment must also be done to control to maintain the blinding.  Data monitoring, assessment, implementation, and analysis is be done by an independent blinded body different from those who collected or entered the data, for example by overview committees, steering committees, data monitoring cmomittees, central monitors, and on-site monitors \cite{baigent2008ensuring}.

\subsection{Reporting Plan}

\section{Policy Matters}
The policies of a trial make sure that the stakeholders, including academic researchers, government regulators, industry vendors, and the public, are appropriately represented in the conduct of the trial.  Policies should ensure fairness and compatibility between the desires of each entity involved in the trial.  To do this, the researchers should clearly specify details of the \emph{Collaboration and Conflicts of Interest} and the \emph{Roles and Obligations} of those involved.

\subsection{Collaboration and Conflicts of Interest}
Any industry sponsor, such as a pharmaceutical company, has a vested interest in seeing their product work.  For the trial to have credibility the design, data collection, and statistical analyses are conducted by groups independent of industry sponsors \cite{friedman2010fundamentals}.  This is facilitated by specifying the parameters of the collaboration in both confidentiality and Cooperative Research and Development Agreements (CRADA)\footnote{\url{http://ctep.cancer.gov/industryCollaborations2/docs/ECT_CRADA.doc}} or a Clinical Trials Agreement.\footnote{\url{http://ctep.cancer.gov/industryCollaborations2/docs/cta-model08_01.doc}}  The CRADA defines the important terms of the collaboration, the obligations of collaborators to adhere to research code (e.g., formally amending a protocol if deviations occur) and an intellectual property option.\footnote{\url{http://ctep.cancer.gov/industryCollaborations2/docs/Intellectual_Property_Option_to_Collaborators.doc}}  Intellectual property rights are protected by ensuring that the contents of the product remain confidential and are only used for the purposes specified in the study protocol.  The CRADA must specify the rights to data and materials from the study, which should be made available to all members of the CRADA \cite{national1999sharing}.

\subsection{Roles and Obligations}
Biomedical research involves collaboration between physicians, academics, pharmaceutical companies, and government agencies.\footnote{\url{http://ctep.cancer.gov/industryCollaborations2/default.htm#guidelines_for_collaborations}}  Physicians are required to perform the primary care and implementation of the research.  Academics design and oversee the scientific aspects of the research, such as creation of the experimental design, construction of the sampling frame, sampling procedures, randomization procedures, blinding procedures, manuscript preparation and reporting requirements.  Pharmaceutical companies develop and provide the drug under investigation.  Government agencies oversee and coordinate the other three groups, providing rigorous quality control including monitoring and evaluation of the trial, and must be convinced that the research being undertaken is safe and effective.

\begin{table}[ph]
\caption{Implementation Checklist}
\label{tab:implementation}
\begin{tabular}{p{12cm} |c| |c|}
Item & Yes & No \\ \hline
\underline{Recruitment, Retention, and Adherence Plan} & & \\
Has the RRAP been constructed? & & \\
Do recruitment procedures follow best practices? & & \\
Have the recruitment documents been pretested? & & \\
Have the recruiters been adequately trained? & & \\
Have multiple retention strategies and themes been used? & & \\
& & \\
\underline{Data and Safety Monitoring Plan} & & \\
Is all participant information secure and confidential? & & \\
Has the study protocol undergone peer review? & & \\
Have the staff been trained to avoid procedural and recording errors? & & \\
Has a central statistical monitoring body been implemented? & & \\
Are case report forms simple, usable, completed, and electronic? & & \\
Have data audits been conducted? & & \\
& & \\
\underline{Reporting} & & \\
Have the authors endorsed and followed the CONSORT statement? & & \\
Are data made freely available for secondary analyses? & & \\
Are statistical analyses reproducible? & & \\ \hline
\end{tabular}
\end{table}

\begin{table}[ph]
\caption{Policy Checklist}
\label{tab:policy}
\begin{tabular}{p{12cm} |c| |c|}
Item & Yes & No \\ \hline
\underline{Collaboration and Conflicts of Interest} & & \\
Have conflicts of interest been identified, minimized, and reported? & & \\
Have property rights been protected? & & \\
& & \\
\underline{Roles and Obligations} & & \\ 
Have the roles and obligations of the utility been specified? & & \\
Have the roles and obligations of the university been specified? & & \\
Have the roles and obligations of the vendor been specified? & & \\
Have the roles and obligations of the regulatory bodies been specified? & & \\ \hline
\end{tabular}
\end{table}

The literature review found a number of plausible mechanisms by which the in-home display might affect consumption.  These included: 1) changing habits, 2) increasing self-efficacy \cite{bandura2001social,thogersen2010electricity,bandura2001social,norman2002design}, 3) allowing goal setting, 4) mapping behaviors to consumption \cite{mcclelland1979energy,yun2009investigating,eiden2009investigation}, 5) encouraging energy efficient appliance purchases \cite{seligman1978behavioral,dobson1992conservation,yun2009investigating}, 6) allowing control and maintenance of appliances \cite{fischer2008feedback,seligman1978behavioral,yun2009investigating}, 7) encouraging sustainable behaviors \cite{yun2009investigating,paetz2011shifting}, 8) promoting awareness of energy consumption, 9) enjoyment from playing with the device \cite{yun2009investigating,paetz2011shifting,hutton1986effects}, and 10) and rational net-benefit calculations \cite{mckenzie2011fostering,roberts2004consumer,steg2009encouraging}.

The utility monitored whether participants' smart meter successfully communicated with the in-home display.  If communication failed for more than one week, participants were contacted by phone from a trained and blinded staff member, who inquired why the display was not active using a scripted questionnaire and then offerered technical assistance in reactivating it.  If the participant refused to continue participation in the study, they were asked whether they could continue to be monitored and whether the collected data could be used.  Retention logs recorded the onset, duration, reason for going off-line, and refusal reasons for every participant.  

Monthly performance evaluations took place over the one-year period by a data safety monitoring board to ensure study protocol adherence and data quality.  The performance evaluation assessed whether recruitment rates were met, whether the recruitment protocol was adhered to, whether retention rates were low and whether new strategies for retention and recruitment needed to be implemented.  The utility company provided updated information on whether participants changed residences.

Any changes to the trial protocol were accompanied by: 1) a cover letter requesting changes and explaining their rationale, 2) a copy of the revised operating procedures with version information including date, 3) tracked changes version of the operating procedures, with version information including date.  Modifications to the protocol included documented amendments with an effective version data footer, once approved.  Any failure to adhere to the protocol specified in the manual of operations, by participants, investigators, or other staff is a protocol deviation and was documented using the protocol deviation form.
